NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Xie, Chulin; Lin, Zinan; Backurs, Arturs; Gopi, Sivakanth; Yu, Da; Inan, Huseyin A; Nori, Harsha; Jiang, Haotian; Zhang, Huishuai; Lee, Yin Tat; et al (July 2024, International Conference on Machine Learning (ICML 2024))

Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named AUGPE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that AUGPE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
more » « less
Full Text Available
Understanding generalization error of SGD in nonconvex optimization

Zhou, Yi; Liang, Yingbin; Zhang, Huishuai. (January 2022, Machine learning)

Full Text Available
Median-Truncated Nonconvex Approach for Phase Retrieval With Outliers

https://doi.org/10.1109/TIT.2018.2847695

Zhang, Huishuai; Chi, Yuejie; Liang, Yingbin (November 2018, IEEE Transactions on Information Theory)

Full Text Available
Non-convex low-rank matrix recovery from corrupted random linear measurements

https://doi.org/10.1109/SAMPTA.2017.8024376

Li, Yuanxin; Chi, Yuejie; Zhang, Huishuai; Liang, Yingbin (July 2017, Sampling Theory and Applications (SampTA), 2017 International Conference on)

Recent work has demonstrated the effectiveness of gradient descent for recovering low-rank matrices from random linear measurements in a globally convergent manner. However, their performance is highly sensitive in the presence of outliers that may take arbitrary values, which is common in practice. In this paper, we propose a truncated gradient descent algorithm to improve the robustness against outliers, where the truncation is performed to rule out the contributions from samples that deviate significantly from the sample median. A restricted isometry property regarding the sample median is introduced to provide a theoretical footing of the proposed algorithm for the Gaussian orthogonal ensemble. Extensive numerical experiments are provided to validate the superior performance of the proposed algorithm.
more » « less
Full Text Available
Non-convex low-rank matrix recovery with arbitrary outliers via median-truncated gradient descent

https://doi.org/10.1093/imaiai/iaz009

Li, Yuanxin; Chi, Yuejie; Zhang, Huishuai; Liang, Yingbin (May 2019, Information and Inference: A Journal of the IMA)

Search for: All records